feat: Switch (NVSwitch) state machine#624
feat: Switch (NVSwitch) state machine#624vinodchitraliNVIDIA wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-03-18 20:48:13 UTC | Commit: a9b5e23 |
🛡️ Vulnerability Scan🚨 Found 72 vulnerability(ies) Severity Breakdown:
🔗 View full details in Security tab 🕐 Last updated: 2026-03-18 20:48:21 UTC | Commit: a9b5e23 |
Introduce a state machine for switch : creation → configuration → validation → BOM validation → ready, with optional reprovisioning and deletion. States and flow Initializing – Switch created in Carbide; controller does initial setup. Configuring – Single sub-state RotateOsPassword; then move to Validating. Validating – Sub-state ValidateComplete; then move to BomValidating. BomValidating – Sub-state BomValidateComplete; then move to Ready. Ready – Switch usable; can be deleted or sent to ReProvisioning. ReProvisioning – Sub-states Start → WaitFirmwareUpdateCompletion; completion via firmware_upgrade_status (Completed → Ready, Failed → Error). Error – Can transition to Deleting if marked for deletion. Deleting – Removal; terminal state. Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>
a9b5e23 to
333cc28
Compare
| | **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. | | ||
| | **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. | | ||
| | **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). | | ||
| | **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. | |
There was a problem hiding this comment.
And then what? What is manual intervention and which state does it go to (i.e. how do you get out of the error state)?
There was a problem hiding this comment.
agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that.
The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.
There was a problem hiding this comment.
Failed switch state will reflect in Rack sate also. Once switch issue is fixed, on-demand rack firmware update will be triggered. This will make sure delta trays moves to ready state and hence rack too
| |-------|-------------| | ||
| | **Initializing** | Switch is created in Carbide; controller performs initial setup. | | ||
| | **Configuring** | Switch is being configured (rotate OS password). Sub-state: `RotateOsPassword`. | | ||
| | **Validating** | Switch is being validated. Sub-state: `ValidateComplete`. | |
There was a problem hiding this comment.
Be more specifically what's being validated.
There was a problem hiding this comment.
@narasimhan321 is working on set of test case. Will add them shortly
| | ReProvisioning (Start) | ReProvisioning (WaitFirmwareUpdateCompletion) | Reprovision triggered | | ||
| | ReProvisioning (WaitFirmwareUpdateCompletion) | Ready | `firmware_upgrade_status == Completed` | | ||
| | ReProvisioning (WaitFirmwareUpdateCompletion) | Error | `firmware_upgrade_status == Failed { cause }` | | ||
| | Error | Deleting | `deleted` set (marked for deletion) | |
There was a problem hiding this comment.
You should be able to get out of Error without deleting the switch.
There was a problem hiding this comment.
noted .. will change
| "switch_serial_number", | ||
| "nvos_mac_address", | ||
| "nvos_username", | ||
| "nvos_password", |
There was a problem hiding this comment.
Is this safe to keep in PG unencrypted? (i.e. are they well known defaults?)
There was a problem hiding this comment.
they are well know default simillar to BMC creds. nvos_password will be rotated during the ingestion
| @@ -0,0 +1,3 @@ | |||
| -- Add nvos_mac_address column to expected_switches table (NVOS host MAC, similar to bmc_mac_address). | |||
| ALTER TABLE expected_switches | |||
| ADD COLUMN nvos_mac_address macaddr; | |||
There was a problem hiding this comment.
Why do we need both nvos_mac_address and bmc_mac_address. You really only need expected_switches to know how to login to it's BMC.
There was a problem hiding this comment.
There is bug in nv switch redfish. In redfish oob there is no info available on cabled eth port on nvos. Without nvos eth0/1 mac address we cant assoicate explored Switch. Further more we cant upgrade firmware
And we cant install anything on nvos to collect mapping info
There was a problem hiding this comment.
should we then add both addresses into expected switches?
| .await | ||
| .map_err(|e| DatabaseError::new("end find_all_preingestion_complete data", e))?; | ||
| let mut managed_switches = Vec::new(); | ||
| for ep in explored_endpoints.into_iter() { |
There was a problem hiding this comment.
The .filter_map() was better. But I just hate for loops.
| // ) -> CarbideResult<()> { | ||
| // //TODO Add this later when and if required | ||
| // Ok(()) | ||
| // } |
| _state: &mut Switch, | ||
| ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>, | ||
| ) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> { | ||
| tracing::info!("Deleting Switch"); |
There was a problem hiding this comment.
Make sure this log message actually says which switch is getting deleted (and what caused it). The tracing macro may annotate the log properly, but I can't tell from here.
| state: &mut Switch, | ||
| _ctx: &mut StateHandlerContext<'_, SwitchStateHandlerContextObjects>, | ||
| ) -> Result<StateHandlerOutcome<SwitchControllerState>, StateHandlerError> { | ||
| tracing::info!("Switch is in error state"); |
There was a problem hiding this comment.
Same comment about more informational log entries (and anywhere else).
| )); | ||
| } | ||
|
|
||
| if is_switch_reprovisioning_requested(state) { |
There was a problem hiding this comment.
What system is responsible for knowing that there's no users using any of the connected machines.
There was a problem hiding this comment.
the rack state machine will perform the actual update, and it would wait until all hosts would move out of Assigned/Ready into a guard state.
So this seems ok
| optional string nvos_password = 8; | ||
| // Unique identifier for the expected switch. When omitted, server generates one. | ||
| optional common.UUID expected_switch_id = 9; | ||
| optional string nvos_mac_address = 10; |
There was a problem hiding this comment.
Why do we need it? Can't it be discovered dynamically? The only reason we would need it is if we wanted to validate the MAC address is the right one.
I also think there's 2 management ports on the switch, so if we go for it, it should be 2.
| | **BomValidating** | BOM (Bill of Materials) validation. Sub-state: `BomValidateComplete`. | | ||
| | **Ready** | Switch is ready for use. From here it can be deleted, or reprovisioning can be requested. | | ||
| | **ReProvisioning** | Reprovisioning (e.g. firmware update) in progress. Sub-states: `Start`, `WaitFirmwareUpdateCompletion`. Completion is driven by `firmware_upgrade_status` (Completed → Ready, Failed → Error). | | ||
| | **Error** | Switch is in error (e.g. firmware upgrade failed). Can transition to Deleting if marked for deletion; otherwise waits for manual intervention. | |
There was a problem hiding this comment.
agreed. We need to provide the tooling to fix the problem. We can't expect anyone to manually scp something to switches or something like that.
The bare minimum would probably be a mechanism to start another installation of the switch OS - which then however requires another NMX-C cluster configuration. The tricky part here seems to be that the switch installation is done via rack state machine.
| } | ||
| } | ||
|
|
||
| /// A combination of DPU and host that was discovered via Site Exploration |
| /// The Switch's BMC IP | ||
| pub bmc_ip: IpAddr, | ||
| // Host mac address | ||
| pub nv_os_mac_addresses: Vec<MacAddress>, |
There was a problem hiding this comment.
switches have 2 management ports and thereby 2 MACs.
I'm however wondering if we need this field here, or whether it's already part of ExplorationReport
| if !created { | ||
| txn.commit() | ||
| .await | ||
| .map_err(|e| DatabaseError::new("commit create_managed_switch", e))?; | ||
| return Ok(false); | ||
| } | ||
|
|
||
| txn.commit() | ||
| .await | ||
| .map_err(|e| DatabaseError::new("commit create_managed_switch", e))?; | ||
|
|
||
| Ok(true) |
There was a problem hiding this comment.
This seems equivalent to
txn.commit()
.await
.map_err(|e| DatabaseError::new("commit create_managed_switch", e))?;
Ok(created)| let existing_switch = db::switch::find_by_id(txn, &switch_id).await?; | ||
|
|
||
| if let Some(_existing_switch) = existing_switch { | ||
| //Possibly multiple eth ports are connected? |
There was a problem hiding this comment.
That seems to be desirable. The log makes it sounds like a problem.
In any case, it seems like the check is somewhat duplicated. Just having either the MAC address or switch ID check seems good enough. Switch ID seems best to me.
There was a problem hiding this comment.
i see that
code will never hit
| name, | ||
| enable_nmxc: false, | ||
| fabric_manager_config: None, | ||
| location: Some("US/CA/DC/San Jose/1000 N Mathilda Ave".to_string()), |
There was a problem hiding this comment.
that should be Metadata on the switch?
There was a problem hiding this comment.
location to come from rack placement. Keeping migrated code from mods.rs
| }; | ||
| let config = model::switch::SwitchConfig { | ||
| name, | ||
| enable_nmxc: false, |
There was a problem hiding this comment.
These fields look like something that is set in carbide configuration and could be changed at runtime (opposed to ingestion time)?
There was a problem hiding this comment.
will check. Just wodering if we want to have per rack enable/disable nmc
There was a problem hiding this comment.
we might need it for migrating some existing deployments. But maybe for these we also just wouldn't have Rack + Switch support enabled?
| )); | ||
| } | ||
|
|
||
| if is_switch_reprovisioning_requested(state) { |
There was a problem hiding this comment.
the rack state machine will perform the actual update, and it would wait until all hosts would move out of Assigned/Ready into a guard state.
So this seems ok
| ); | ||
| } | ||
|
|
||
| let switch_handler = Arc::new(SwitchStateHandler::default()); |
There was a problem hiding this comment.
please use the run_single_iteration() flows that are used in all other places - e..g. for host tests. In these we can check things step by step much easier since there's no timing constraints. And you also won't need to construct all these handlers and state controllers another time. They are already part of TestEnv.
A state machine for switch : creation → configuration → validation → BOM validation → ready, with optional reprovisioning and deletion.
States and flow
Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes